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1. INTRODUCTION 

Hearing-impaired is a generally acceptable term used to describe someone who has a hearing loss. It 
could refer to someone who is born deaf or became deaf later in life. Hearing-impaired individuals are faced 
with challenges of communication, especially with non-hearing-impaired individuals, as the basic ways of 
communication for them is through sign language, body language, and facial gestures. A major consideration 
in communication between hearing-impaired and non-hearing-impaired counterparts bothers on how to aid, 
facilitate and enhance effective communications between them. Over 5% of the world’s population is hearing 
impaired and it is estimated that by 2050, over 700 million people, or one in every ten people will have a 
disability of hearing [1]. About 23.7% of Nigerians are estimated to have hearing impairment [2], which 
translates to about 24 persons in every 100 of the total population. Communication between hearing-impaired 
and non-hearing-impaired persons can be considered a more serious problem compared to communication 
with visually-impaired people. While non-hearing-impaired individuals communicate freely via verbal 
languages, and listen to audio messages. A non-verbal communication method constitutes the main mode of 
sharing information amongst hearing-impaired people. 
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Hearing-impaired persons communicate mainly through sign language (SL), which includes the use 
of facial gestures, certain body part movements, and lip-reading. SL is country-specific which makes its 
translation a difficult task [3]. The most important thing is getting the attention of the hearing-impaired, 
which can be done via several ways such as touching the arm or shoulder, waving, stamping on the floor, and 
switching the light on and off [4]. The choice and effectiveness of a method in catching the attention of the 
hearing-impaired person depend on how well the two parties in the communication process are familiar. 

Going by the number of hearing-impaired people around us, their inability to convey their thought 
constitutes a major problem, especially in areas where they exist in minority. Effective communication is 
required for peaceful co-existence and the general well-being of the society at large. In view of this, SLs were 
developed to facilitate effective communication between hearing-impaired persons. To a large extent, this 
measure has addressed the problem of communication among educated hearing-impaired individuals. 
However, what about communication between hearing-impaired and non-hearing-impaired persons? Hence, 
there is a need for an arbiter between these two sets of people in order to forestall avoidable crises that may 
emanate from communication gap. Essentially, two key issues are involved in arriving at such a desired 
arbiter: sign language recognition (SLR) and sign language translation (SLT). 

Research into recognizing and interpreting SL gestures started a few years back. SLR schemes can 
be grouped into three, depending on the form it takes. They are: SLR schemes that rely on the recognition of 
finger-spellings, those that are based on the recognition of isolated words, and lastly, continuous sentence 
construction recognition based SLR [5]. Many of the earlier research efforts into the SLR scheme adopted 
traditional recognition approaches which include the use of hidden Markov model [6] for words recognition, 
support vector machine [7], [8] for classification of both isolated words and continuous SL alphabets, and 
trajectory matching for isolated words grouping. Recently, different varieties of deep-learning methods 
convolutional neural network (CNN) [9], [10], long short-term memory (LSTM) [11]-[14] are being utilized 
singly and in hybrid configurations [15], [16] to address the problem of SLR, especially in applications 
involving the recognition of continuous sentence structure. 

Unlike the situation with SLR, reports on SLT are scanty in the literature. However, knowledge 
gained from the research into SLR provides ample leverage in SLT development to facilitate effective 
communication between the hearing-impaired and non-hearing-impaired persons in our society. To that end, 
few proposals on SLT that can be found in the literature include deep learning model-based SLT [11], [15], 
[17]-[20] and sensor-based SLT [21]—[35]. Majority of the neural SLT models adopt a multimodal structure 
in their construction such that sequential connections exist between CNN and neural machine translation 
(NMT). While the NMT module is essentially the kernel for the translation of SL gestures into target 
sentences, CNN is used for the extraction of image-level features that serve as NMT input. The critical 
problem, about all deep learning-based SLT models, is the requirement for a large dataset, which is not 
readily available. This requirement hinders the performance of resulting SLT models. Sensor-based SLTs are 
usually built around a glove incorporated with microtouch switch or other electronic devices for gesture 
recognition and translation into text and speech equivalents. 

Other researchers developed different SLTs such as software-based platform [36], MobilenetV2- 
based gestures recognition system [37], and tablet-based hearing aid [38]. An open-source software 
framework developed in [36] presents a development environment for building of augmentative and 
alternative communication models that include communication aids for the disabled community. A 
MobilenetV2-based gesture recognition system was developed in [37]. Although, the system is specifically 
meant for smart home applications, it can as well be deployed for gestures translation. Cameras of mobile 
devices are used for detection and capturing of data from objects with the image presented in the frame by a 
bounding box. The focus of the developed device in [38] was a tablet-based hearing aid where sign language 
gestures are digitally processed by the tablet before being wirelessly relayed to standard earphones for better 
output. 

This work proposes a PC-based communication aid to facilitate effective communication between 
hearing-impaired and non-hearing-impaired people. American sign language (ASL) gestures are employed in 
the development whereby a database of hand gestures in ASL is created using Python scripts while the 
pipeline configuration model for machine learning of annotated images of gestures in the database with the 
real-time gestures is realized via the use of TensorFlow (TF). The developed SLT running on a PC equipped 
with a web camera that captures real-time gestures for comparison and interpretations is implemented in 
Python software environment. Outputs of the developed SLT are translation of ASL/gestures into written 
texts and corresponding audio renderings at an average duration of about one second. The novelty of this 
paper is premised on the non-introduction of new device, which leads to reduced cost in the longrun. 
Personal computer system is a veritable and common tool nowadays, that is available to different individuals 
(hearing-impaired and non-hearing impaired inclusive) mingling and interacting together. While sensor- 
based SLTs SLT [21], [23]-{33], [35] tend to be costlier, the proposed PC-based SLT in this paper is 
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relatively cheaper as no new separate gadgets are required. In addition, no need for additional training as the 
device can be operated by whoever can handle a computer system. Specifically, a database of ASL gesture is 
created using Python script, which enables faster further signal processing, and the development of TF 
pipeline configuration model to interface with the created database for machine learning. 

The rest of the paper is structured as follows. Presented in section 2 is the method while section 3 
has results and discussion. The paper is concluded in section 4. 


2. METHOD 

The developed communication aid between hearing and non-hearing-impaired individuals is 
achieved by converting ASL gestured by the hearing impaired to a corresponding text and audio signals for 
the non-hearing impaired to interpret. The stages involved in the development are basically three: i) creation 
of database or datasets of images for each of the selected ASL gestures using a Python script; ii) creation of a 
pipeline configuration model using TF software to interface with the created database for machine learning; 
and iii) deployment of the TF pipeline configuration model in a Python software environment for matching, 
comparison, and decision making with real-time images of ASL gestures and production of corresponding 
texts and audio words. In order to realize the above steps, a few materials (hardware and software) are 
required. They are briefly highlighted in what follows, beginning with hardware materials. 


2.1. Hardware materials 

Hardware materials used in this study include: i) a 4.00 gigabyte (GB) random access memory 
(RAM), 2.60 gigahertz (GHz) processor, 64-bit, laptop running Windows 10 operating system-the laptop is 
used as the workbench for the developed communication aid and it housed the software component of the 
developed communication aid for deployment and utilization; ii) a 1,080 P video recording 12.0 M pixel high 
definition (HD) webcam-this is used to capture real-time images of the sign language and gestures before it. 
The webcam is connected to the laptop through its universal serial bus (USB) port and images captured by 
the webcam are saved to the laptop; iii) a SanDisk 8 GB Micro secure digital (SD) memory card: the memory 
card is used to compile the selected images of the gestures or sign languages. It will be slotted into the SD 
port of the laptop to extract the images; iv) other hardware like speakers is provided as peripheral to the 
computer to enhance audio output rendering of the gesture images; and v) sign language gesture—specifically, 
ASL gesture images are employed in the development of the communication aid reported in this paper. 


2.2. Software components 

Software components used in this study include: i) Python software v3.8, a popular open-source 
software and programming language. It is used as a model building and deployment environment for the 
developed communication aid. The choice is informed by its versatility and ease of use for different scripting 
applications in a wide variety of domains; ii) Labelimg, a graphical image annotation tool for labeling object 
bounding boxes in images. It is written in Python for the creation of bounding box annotations of gesture 
images. The created annotations are saved in CreateML formats; iii) Pyttsx3, adopted a text-to-speech 
conversion library for this work. It is a library in Python and is chosen because it works offline unlike 
available alternative libraries that work mainly online; iv) TF, a software library or framework, designed by 
the Google team, for easier and faster implementation of machine learning and deep learning concepts. Core 
functionalities of TF that favour its choice for this work are: augmented tensor operations with seamless 
interfaces with existing programs, automatic differentiation, which occupies the very core of optimization- 
based algorithms, and parallel and distributed (multi-machine) computing. TF is used in the creation of a 
pipeline configuration model (an application interface) to facilitate easier detection of objects. The pipeline 
configuration file is split into five parts: model configuration, train configuration, evaluation configuration, 
train input configuration, and evaluation input configuration; v) Mobilenet_v2_SSD, an object detector that 
can be used on real-time images for location finding. Detected points/locations are described by bounding 
boxes with each of the bounding boxes assigned a class; vi) Jupyter server, an extension in Python software 
environment that extends the console-based approach to interactive computing in a qualitatively new 
direction, providing a web-based application suitable for capturing the whole computation process: 
developing, documenting, and executing code, as well as communicating the results. The Jupyter server 
combines two components: a web application and notebook documents. For this work, the web application 
component is used; and vii) deepstack server, an artificial intelligence (AI) server that enables development 
of faster AI systems both on premise and in the cloud. DeepStack runs on the docker platform but can be 
used from any programming language. 
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2.3. Procedure 

Having highlighted the needed materials (both hardware and software), the next undertaking is the 
description of the procedure involved in the development of the developed communication aid. A Python 
script is written to compile and create the database for ASL gestures. The Python written script uses 
OpenCV, a library in Python for computer vision and imaging, to facilitate interaction between the laptop and 
webcam. This interaction allows storage of snapshots of ASL gestures made by the webcam for further 
processing. The images of some of the ASL gestures used and their corresponding meanings are shown in 
Figure 1 while Figure 2 depicts the snapshot of the ASL database created. Figure 1 describe the typical ASL 
gestures used in the creation of the database with corresponding meanings (a) hello, (b) I love you, (c) nice to 
meet you, (d) no, (e) please, and (f) sorry. 

As describe earlier, Python Labellmg is then used for annotation of objects in the created ASL 
database. Figure 3 illustrates a typical form of Labellmg annotation, specifically for the gesture image 
“hello”. Building of pipeline configuration model for machine learning was realized via the use of TF. The 
pipeline is used to convert annotations and the created ASL database into TF record format for machine 
learning. For object detection, Mobilenet_v2_SSD is used in conjunction with the created TF pipeline. 
Figure 4 shows extract from the TF pipeline created. 


Li 


(d) 


Figure 1. Typical ASL gestures used in the creation of the database with corresponding meanings (a) hello, 
(b) I love you, (c) nice to meet you, (d) no, (e) please, and (f) sorry 
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Figure 2. The snapshot of the ASL gesture images database created 
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<annotation> 
<folder>dataset</folder> 
<filename>hello_@.jpg</filename> 
<path>/home/lalah/Downloads/Sign Language Dataset/dataset_TFOD/dataset_TFOD/dataset/hello_@.jpg</path> 
<source> 
<database>Unknown</database> 
</source> 
<size> 
<width>640< /width> 
<height>48</height> 
<depth>3</depth> 
</size> 
<segmented>@</segmented> 
<object> 
<name>hello</name> 
<pose>Unspecified</pose> 
<truncated>@</truncated> 
<difficult>é</difficult> 
<bndbox> 
<xmin>116</xmin> 
<ymin>159</ymin> 
<xmax>207< /xmax> 
<ymax>348< /ymax> 
</bndbox> 
</object> 
</annotation> 


Figure 3. Snapshot of the Python labellmg for the annotated gesture image “hello” in the database 


© pipeline.config X M = 
model { ka 
ssd { ka 
num_classes: 9 
image_resizer { 
fixed_shape_resizer { 
height: 320 
width: 320 
} 
} 


NOU PR wne 


© 0 


Figure 4. Extract from the created TF pipeline configuration model 


The TF pipeline configuration model is trained on the created database and its annotations in Python 
software environment. The model is first deployed on Jupyter server, then on deepstack server. The 
deployment on Jupyter and deepstack run on port 80 and simply points to the saved model's directory. The 
model compares, match and make final decisions on the real time images of ASL displayed by the gesturer. It 
also displays the boundary box coordinates which indicates the precision range from 0 to 1. An evaluation 
script that uses OpenCV facilitates interaction with the webcam while Pyttsx3 sees to conversion of the 
translated sign gestures text equivalents to the corresponding audio in real-time. 


3. RESULTS AND DISCUSSION 

Findings from deployment of the developed SLT are presented here as well as its response time 
when real time gestures are positioned before a PC running the developed communication aid. To that end, 
the discussion is grouped into two, beginning with the process of engaging the developed SLT. 


3.1. Engaging the developed SLT 

A gesturer is positioned facing the webcam of the PC where the developed virtual communication 
aid is running. This line of instruction is entered at command prompt on the PC: deepstack--MODELSTORE- 
DETECTION"C:\Users\USERPC\Desktop\ and Myproject\fine_tuned_model (docker)\ObjectDetection\ 
models"--PORT 80. To enable the webcam to start the video streaming, the above line of instruction is 
followed by: cd "C:\Users\USER PC\Desktop\My project\fine_tuned_model (docker)\ObjectDetection" and 
python livefeed_detection.py. 

Successful operation and entering of correct instructions lead to the display of the properties of the 
real time image in front of the webcam and matching of the image with the equivalent in the database to give 
appropriate outputs (text displayed and audio rendering). In addition, a text is displayed on the PC screen to 
indicate the accuracy of the matching process. Pressing of the letter “Q” on the keyboard halt the live feed. 
Figure 5 shows a snapshot of the first and second line of instructions when entered at command prompt. 
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[Microsoft Windows [Version 10.0.17763.1577] 
þ(c) 2018 Microsoft Corporation. All rights reserved. 


letection\models™ --PORT 80 
DeepStack: Version 2021.02.1 


ivi/restore 


C: \Users\USER PC>cd “C:\Users\USER PC\Desktop\My project\fine_tuned_model (docker)\ObjectDetection™ 


C:\Users\USER PC\Desktop\My project\fine_tuned_model (docker)\ObjectDetection>python livefeed_detection.py 
[INFO] starting video stream... 


Figure 5. Snapshot of the first and second instructions entered at command prompt to initiate detection and 
streaming of the gesturer in the front of the PC camera 


Once the two instructions have been entered at the command prompt, real time image of gesturer in 
front of the camera is processed accordingly with the equivalent interpretation of the sign language translated 
into appropriate outputs (text displayed and audio). Figure 6 portrays obtained results when nine gestures are 
presented in front of the PC running the developed SLT. As can be observed from Figure 6, in addition to 
displayed text equivalent rendering of ASL gestures, certain numerical values, which range between 0 and 1 
are shown. Those numerical figures indicate precision figure associated with each translated ASL gesture. 
Figure 6 describe the matching and translation of some ASL gestures using the developed PC-based sign 
language translator (a) yes, (b) please, (c) nice to meet you, (d) hello, (e) I love you, (f) you are welcome, 
(g) thanks, (h) sorry, and (i) no. 


Figure 6. Matching and translation of some ASL gestures using the developed PC-based sign language 
translator (a) yes, (b) please, (c) nice to meet you, (d) hello, (e) I love you, (f) you are welcome, (g) thanks, 
(h) sorry, and (i) no 
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3.2. Time elapse for inference making 

Table 1 presents the time taken by the developed PC-based based SLT to render selected ASL 
gestures presented in front of PC where it is deployed, into corresponding text and audio interpretations. In 
addition, translation precision values as captured for each of the selected ASL gestures are also presented. It is 
obvious from the results presented in Table 1 that the developed PC-based sign language translator achieves its 
aim. Going by the outputs illustrated, each of the nine ASL gestures used for the evaluation is successfully 
translated into word equivalent within reasonable time. This indicates that the developed sign language 
translator is suitable as a communication aid between hearing-impaired and non-hearing-impaired individuals. 


Table 1. Duration of time for the inference on prediction of gestures and translation precision 


Interpreted gestures Duration of time for inference on prediction (Seconds) Translation 
Minimum Maximum Average precision (%) 
Yes 0.7594 1.1143 0.9369 79.9578 
Please 0.7325 1.3072 1.0199 75.1198 
Nice to meet you 0.7945 1.2642 1.0294 51.1127 
Hello 0.7725 1.2792 1.0259 57.7864 
I love you 0.8685 1.3692 1.1189 44.6005 
You are welcome 0.9114 1.4731 1.1923 51.7557 
Thank you 0.8795 1.2492 1.0644 61.2952 
Sorry 0.8885 1.2392 1.0639 75.6436 
No 0.8885 1.2392 1.0639 76.5432 


The duration for processing of each of the ASL gestures used in the evaluation is generally about one 
second, which shows that the translation is done in real-time without delay which may introduce frustration in 
the communication process if a longer time is involved. The precision value attached to each gesture indicates 
how accurate the real-time images are matched with those in the annotated database. The least in the set of 
ASL gestures used for the evaluation of the developed sign language translator is about 44% (corresponding to 
“I love you”) while the highest is approximately 80% (corresponding to “yes”). These results show how robust 
the developed sign language translator is, for it detect and correctly translate gestures into required outputs 
(text and audio rendering) when the matching between the real image and real image is about 44%. 

It is worth pointing out here that the proposed PC-based SLT in this paper takes lesser time to 
respond when compare with the response time of most sensor-based SLTs where few seconds are needed for 
recognition. For instance, it is reported in [28] that the detection of hand motions took a few seconds as the 
user had to hold a formed sign for two seconds to ensure recognition. This is in addition to the lower cost of 
implementation, acquisition, and operation of the proposed SLT since it is simply an ‘add-in’ to a typical PC. 
Table 2 further summarizes the comparison of PC-based SLT proposed in this paper with two others that can 
be found in the literature. 


Table 2. Comparison of different SLTs performances 


Parameters SLTs 
Sensor and microcontroller-based Neural-based Proposed PC-based 
Cost Moderate cost Moderate Relatively cheap as no need for other gadgets 
Operation Require training for operation Required hands-on No special training besides basic computer 
training operation 
Response time Few seconds required Longer time Average of one-second response time 
Database Limited owing to the size of the Large Large 


microcontroller memory space 


4. CONCLUSION 

A PC-based sign language translator has been developed in this paper. The developed SLT is shown 
to successfully translate nine different ASL sign gestures into text and audio equivalents. These written and 
audio interpretations rendering of sign language gestures will go a long way in facilitating effective 
communication between hearing-impaired and non-hearing-impaired individuals in our society when 
deployed. Hence, it provides a method of addressing the problem of communication between these 
individuals and aid their mutual interactions in daily activities. 
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